To take the normalized expression data that was created in Assignment #1 and rank the genes according to differential expression. Once the gene list is ranked, a thresholded over-representation analysis is performed to highlight dominant themes in the top set of genes.
The dataset is derived from genetically engineered Foxd1Cre;Smo(flox/-) house mice and aims to investigate the mRNA content of mutant kidneys to gain a deeper understanding of the molecular mechanisms underlying the developmental defects observed in the kidney. The Hedgehog-GLI pathway, which is responsible for the development and patterning of various tissues and organs, including the kidney, interacts with the transforming growth factor beta (TGF) signaling pathway, another critical regulator of kidney development. Smo, a key component of the Hedgehog-GLI signaling pathway, plays a crucial role in human kidney development and modulates TGFβ signaling in the kidney, and its disruption can lead to abnormal kidney development and function. The ultimate goal of understanding the role of Smo in these signaling pathways is to uncover the molecular mechanisms underlying normal kidney development and diseases, which could ultimately lead to the development of new therapeutic strategies for kidney disorders.
In the previous work on the dataset with accession ID GSE103923, an initial processing of the data has been done, including accessing the dataset quality, filtering the low expression gene. The dataset was downloaded in a normalized data with identified as a mixed of HUGO symbol and he Mouse Genome Informatics (MGI) symbol. Hence, a identifier mapping was performed, and no extra normalization was applied. The result of assignment one is a a clean, normalized dataset stored in the root directory as “smo_exp_normalized.txt”.
Tools are needed for a differential expression analysis and a threshold over-representation analysis for our normalized dataset created in Assignment 1:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
library(BiocManager)
if (!requireNamespace("GEOmetadb", quietly = TRUE))
BiocManager::install("GEOmetadb")
library(GEOmetadb)
if (!requireNamespace("biomaRt", quietly = TRUE))
BiocManager::install("biomaRt")
library(biomaRt)
if (!requireNamespace("edgeR", quietly = TRUE))
BiocManager::install("edgeR")
library(edgeR)
if (!requireNamespace("circlize", quietly = TRUE))
install.packages("circlize")
library(circlize)
if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
BiocManager::install("ComplexHeatmap")
library(ComplexHeatmap)
if (!requireNamespace("gprofiler2", quietly = TRUE))
BiocManager::install("gprofiler2")
library(gprofiler2)
if (!requireNamespace("knitr", quietly = TRUE))
install.packages("knitr")
library(knitr)
if (!requireNamespace("ggplot2", quietly = TRUE))
install.packages("ggplot2")
library(ggplot2)
if (!requireNamespace("splines", quietly = TRUE))
install.packages("splines")
library(splines)
if (!requireNamespace("GEOquery", quietly = TRUE))
BiocManager::install("GEOquery")
library(GEOquery)
if (!requireNamespace("Biobase", quietly = TRUE))
BiocManager::install("Biobase")
library(Biobase)
if (!requireNamespace("dplyr", quietly = TRUE))
install.packages("dplyr")
library(dplyr)
if (!requireNamespace("kableExtra", quietly = TRUE)){
install.packages("kableExtra")}
library(kableExtra)
The result dataset from assignment 1 is stored in our root directory as “smo_exp_final.txt”
# Retreive the Assignment 1 result dataset
normalized_counts <- read.table("smo_exp_normalized.txt")
# Inspect the dataset
kable(normalized_counts[1:5,1:4], format = "html")
| MUT1 | MUT2 | MUT3 | MUT4 | |
|---|---|---|---|---|
| A2m | 8.594266 | 19.16863 | 8.961176 | 10.73999 |
| A4galt | 171.885312 | 128.85579 | 144.498960 | 128.87987 |
| Aaas | 543.348570 | 562.27980 | 541.030990 | 543.44347 |
| Aacs | 340.905869 | 363.13903 | 375.249237 | 379.12163 |
| Aagab | 550.987917 | 514.35822 | 567.914518 | 459.67155 |
kable(normalized_counts[1:5,5:8], format = "html")
| WT1 | WT2 | WT3 | WT4 | |
|---|---|---|---|---|
| A2m | 8.506106 | 18.60708 | 11.62576 | 20.68402 |
| A4galt | 127.591595 | 138.51935 | 116.25760 | 131.62556 |
| Aaas | 507.814548 | 516.86324 | 516.37751 | 546.24606 |
| Aacs | 346.198528 | 350.43328 | 351.67924 | 331.88444 |
| Aagab | 495.055389 | 528.23424 | 450.49820 | 448.46708 |
Two steps will be performed in this step: * Step 1: Create a design model to be used for calculating differential expression, * Step 2: Perform an analysis of differential expression using the normalized expression data obtained in A1.
# Calculate distances and plot MDS
dist <- dist(t(log2(cpm(normalized_counts) + 1)))
plotMDS(dist,
labels=colnames(normalized_counts),
col=ifelse(grepl("WT",colnames(normalized_counts)), "#00FFFF", "#FF00BF"),
pch=16, main="Normalized MDS plot of Smo(flox/-) mutants and wild type")
Figure 1. Multidimensional Scaling (MDS) plot showing the clustering of samples based on gene expression data. Each point represents a sample, colored according to the treatment group, cyan = Smo-knockout (MUT), pink = wild type (WT). The plot reveals clear separation of the knockout and wild type samples, indicating significant differences in gene expression between the two groups.
# define the groups
samples <- data.frame(
lapply(colnames(normalized_counts), FUN=function(x) {
if (grepl("MUT", x)) {
# If the column name contains "MUT", extract the characters following "MUT"
"MUT"
} else if (grepl("WT", x)) {
# If the column name contains "WT", extract the characters following "WT"
"WT"
}
})
)
# set row and column names
colnames(samples) <- colnames(normalized_counts)
rownames(samples) <- c("genotype")
samples <- data.frame(t(samples))
#inspect
samples[1:8,]
## [1] "MUT" "MUT" "MUT" "MUT" "WT" "WT" "WT" "WT"
# create a design matrix of a linear model
design <- model.matrix(~ samples$genotype)
# inspect
kable(design[2:7,], type ="html")
| (Intercept) | samples$genotypeWT | |
|---|---|---|
| 2 | 1 | 0 |
| 3 | 1 | 0 |
| 4 | 1 | 0 |
| 5 | 1 | 1 |
| 6 | 1 | 1 |
| 7 | 1 | 1 |
# create a DGEList object
dge <- DGEList(counts = normalized_counts, group = factor(rep(c("MUT", "WT"), each = 4)))
# filter low-expressed genes
keep <- rowSums(cpm(dge) > 1) >= 2
dge <- dge[keep,]
# normalize data
dge <- calcNormFactors(dge)
# estimate dispersion
dge <- estimateDisp(dge)
# estimate tagwise dispersion
tagwise_disp <- estimateTagwiseDisp(dge)
# generate MV plot
plotMeanVar(dge,
show.raw.vars = TRUE,
show.tagwise.vars = TRUE,
NBline = TRUE,
show.ave.raw.vars = TRUE,
show.binned.common.disp.vars = TRUE,
main = "Mean-Variance Plot of Smo(flox/-) mutants and wild type")
# add legend
legend("topleft",
legend=c("Raw Data", "Tagwise Dispersion", "Average Raw Variances",
"Binned Common Dispersion", "Negative Binomial Line"),
col = c("grey", "lightblue", "maroon", "red", "dodgerblue2"),
pch=c(1,1,4,4,NA), lty=c(0,0,0,0,1), lwd=c(1,1,1,1,2), cex=0.6)
Figure 2. Mean-variance plot showing the distribution of data. The dispersion and variance of the data follows the negative binomial distribution, in which the raw data and the blue negative binomial line alligns.
# create a matrix
expressionMatrix <- as.matrix(normalized_counts[,1:8])
# Set row and column names
rownames(expressionMatrix) <- rownames(normalized_counts)
colnames(expressionMatrix) <- colnames(normalized_counts)[1:8]
# Create a minimal ExpressionSet object to use with limma
minimalSet <- ExpressionSet(assayData=expressionMatrix)
# Fit a linear model
fit <- lmFit(normalized_counts, design)
# Use eBayes to estimate the posterior distribution of log-fold changes
# and calculate moderated t-statistics for each gene
fit2 <- eBayes(fit, trend=TRUE)
# Get the top differentially expressed genes
top_genes <- topTable(fit2,
coef=ncol(design),
adjust.method = "BH",
number = nrow(expressionMatrix))
# merge gene ids to topfit table
output_hits <- merge(rownames(normalized_counts),
top_genes,
by.y=0,by.x=1,
all.y=TRUE)
# sort by p-value
output_hits <- output_hits[order(output_hits$P.Value),]
# inspect
kable(output_hits[1:5,1:7],type="html",row.names = FALSE)
| x | logFC | AveExpr | t | P.Value | adj.P.Val | B |
|---|---|---|---|---|---|---|
| Smo | 1703.9076 | 1389.3918 | 49.100727 | 0.0e+00 | 0.0000000 | -4.445771 |
| Plxna4 | -292.1647 | 433.8474 | -13.459112 | 1.0e-07 | 0.0005542 | -4.453819 |
| Chd3 | 954.4752 | 4525.8799 | 9.104120 | 3.8e-06 | 0.0083446 | -4.463050 |
| Spock2 | -772.6805 | 2987.0129 | -9.100051 | 3.8e-06 | 0.0083446 | -4.463064 |
| Dock11 | 467.8391 | 1359.7519 | 8.960042 | 4.4e-06 | 0.0083446 | -4.463567 |
# Rename the column name of gene
colnames(output_hits)[1] <- "gene"
# calculate the number of gene pass the threshold p-value < 0.05
length(which(output_hits$P.Value < 0.05))
## [1] 1027
# How many genes pass correction (multipole hypothesis testing) ?
length(which(output_hits$adj.P.Val < 0.05))
## [1] 49
# Create a vector of colors for each point in the volcano plot
colors <- ifelse(output_hits$P.Value < 0.05 & output_hits$logFC > 1, "red",
ifelse(output_hits$P.Value < 0.05 & output_hits$logFC < -1, "green", "grey"))
# Plot the volcano plot with colored points
plot(output_hits$logFC, -log10(output_hits$P.Value), pch=20, col=colors,
xlab="log2(Fold Change)", ylab="-log10(P-value)",
main="Volcano Plot of Smo(flox/-) mutants and wild type")
# Add legend for the colors
legend("topright", legend=c("Up-regulated", "Down-regulated", "Non-significant"),
col=c("red", "green", "grey"), pch=20, cex=0.8)
# Label genes with over top 10 logFC
# Subset top 10 genes with largest absolute logFC values
top10_genes <- subset(output_hits,
abs(logFC) > sort(abs(logFC), decreasing = TRUE)[10])
# Add text labels to the top 10 logFC points in volcano plot
with(output_hits, {
text(x = logFC,
y = -log10(P.Value),
labels = ifelse(gene %in% top10_genes$gene, gene, ""),
pos = 4, cex = 0.5, srt=45, col="blue")
})
Figure 3. Volcano plot for differential gene expression analysis between 4 Smo(flox/-) mutants(MUT) and 4 wild type (WT) samples. Each point represents a gene, with red points indicating upregulated genes, green points indicating down-regulated genes, and grey points indicating non-significant genes. The gene of interest, Smo, is highlighted in blue. The plot shows significant differential expression of several genes, with Smo showing a log2 fold change of X and a p-value of Y
# Get top hit genes based on p-value and log fold change
top_hits <- output_hits$gene[output_hits$P.Value < 0.05 & abs(output_hits$logFC) > 2]
# Calculate logCPM values for all genes
hm_matrix <- log2(normalized_counts + 1) # Add 1 to avoid log transformation of zero counts.
# Subset matrix to include only top hit genes
heatmap_matrix_tophits <- t(scale(
t(hm_matrix[which(rownames(hm_matrix) %in% top_hits),]),
center = TRUE, scale = TRUE))
# Choose colors for heatmap
heatmap_color <- colorRamp2(c(min(heatmap_matrix_tophits), 0, max(heatmap_matrix_tophits)),
c("#00FFFF", "white", "#FF00BF"))
# Create heatmap with dendrograms for genes and samples
current_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
show_row_dend = TRUE,
show_column_dend = TRUE,
col = heatmap_color,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,
column_names_gp = grid::gpar(fontsize = 6),
heatmap_legend_param = list(title = "Expression level"))
draw(current_heatmap,
column_title=
"Gene expression between Smo(flox/-) mutants (MUT) and wild type (WT) ",
column_title_gp=grid::gpar(fontsize=12))
Figure 4. Heatmap representation of gene expression analysis between 4 Smo(flox/-) mutants (MUT) and 4 wild type (WT) samples, showing log2 counts per million (logCPM) values for top hit genes identified by differential expression analysis. Rows represent genes and columns represent samples, with pink indicating higher expression and cyan indicating lower expression. The dendrograms on the top and left show clustering of genes and samples, respectively. The heatmap highlights distinct patterns of gene expression between the Smo(flox/-) mutants and control samples, with several key genes showing significant up- or down-regulation.
Run a thresholded gene set enrichment analysis (GSEA) with your significantly up-regulated and down-regulated set of genes using the g:Profiler.
# Create a vector of gene symbols for the up-regulated and down-regulated genes
upregulated <- output_hits[output_hits$logFC > 0 & output_hits$P.Value < 0.05, ]
downregulated <- output_hits[output_hits$logFC < 0 & output_hits$P.Value < 0.05, ]
upregulated_genes <- upregulated$gene
downregulated_genes <- downregulated$gene
# analyze for all differentially expressed genes
enrich <- gost(query = output_hits$gene, organism = "mmusculus",
ordered_query = TRUE,
measure_underrepresentation = TRUE,
correction_method = "fdr",
sources = c("GO:BP", "REAC", "WP"))
# analyze for up-regulated differentially expressed genes
enrich_upreg <- gost(query = upregulated_genes, organism = "mmusculus",
ordered_query = TRUE,
measure_underrepresentation = TRUE,
correction_method = "fdr",
sources = c("GO:BP", "REAC", "WP"))
# analyze for down-regulated differentially expressed genes
enrich_downreg <- gost(downregulated_genes, organism = "mmusculus",
ordered_query = TRUE,
measure_underrepresentation = TRUE,
correction_method = "fdr",
sources = c("GO:BP", "REAC", "WP"))
# extract the enriched gene sets with a p-value < 0.05
enriched <- enrich$result[enrich$result$p_value < 0.05, ]
enriched_upreg <- enrich_upreg$result[enrich_upreg$result$p_value < 0.05, ]
enriched_downreg <- enrich_downreg$result[enrich_downreg$result$p_value < 0.05, ]
# inspect top 5
knitr::kable(head(enriched[,3:11], 5), format = "html",
table.attr = 'style="width: 30%; border-collapse: separate; border-spacing: 15px;"')
| p_value | term_size | query_size | intersection_size | precision | recall | term_id | source | term_name |
|---|---|---|---|---|---|---|---|---|
| 0 | 27334 | 10 | 9 | 0.9000000 | 0.0003293 | GO:0008150 | GO:BP | biological_process |
| 0 | 1508 | 10495 | 30 | 0.0028585 | 0.0198939 | GO:0007606 | GO:BP | sensory perception of chemical stimulus |
| 0 | 1288 | 10716 | 21 | 0.0019597 | 0.0163043 | GO:0009593 | GO:BP | detection of chemical stimulus |
| 0 | 1196 | 10495 | 17 | 0.0016198 | 0.0142140 | GO:0007608 | GO:BP | sensory perception of smell |
| 0 | 1338 | 10674 | 44 | 0.0041222 | 0.0328849 | GO:0050906 | GO:BP | detection of stimulus involved in sensory perception |
knitr::kable(head(enriched_upreg[,3:11], 5), format = "html",
table.attr = 'style="width: 30%; border-collapse: separate; border-spacing: 15px;"')
| p_value | term_size | query_size | intersection_size | precision | recall | term_id | source | term_name |
|---|---|---|---|---|---|---|---|---|
| 0.0e+00 | 27334 | 8 | 7 | 0.8750000 | 0.0002561 | GO:0008150 | GO:BP | biological_process |
| 0.0e+00 | 3625 | 585 | 13 | 0.0222222 | 0.0035862 | GO:0006396 | GO:BP | RNA processing |
| 1.4e-06 | 1572 | 455 | 2 | 0.0043956 | 0.0012723 | GO:0022613 | GO:BP | ribonucleoprotein complex biogenesis |
| 1.7e-06 | 1633 | 507 | 4 | 0.0078895 | 0.0024495 | GO:0006397 | GO:BP | mRNA processing |
| 5.8e-06 | 1449 | 536 | 4 | 0.0074627 | 0.0027605 | GO:0000375 | GO:BP | RNA splicing, via transesterification reactions |
knitr::kable(head(enriched_downreg[,3:11], 5), format = "html",
table.attr = 'style="width: 30%; border-collapse: separate; border-spacing: 15px;"')
| p_value | term_size | query_size | intersection_size | precision | recall | term_id | source | term_name |
|---|---|---|---|---|---|---|---|---|
| 0.0000000 | 27334 | 155 | 154 | 0.9935484 | 0.0056340 | GO:0008150 | GO:BP | biological_process |
| 0.0004051 | 2350 | 395 | 9 | 0.0227848 | 0.0038298 | GO:0007186 | GO:BP | G protein-coupled receptor signaling pathway |
| 0.0000000 | 8407 | 3 | 2 | 0.6666667 | 0.0002379 | REAC:0000000 | REAC | REACTOME root term |
| 0.0000000 | 1619 | 427 | 31 | 0.0725995 | 0.0191476 | REAC:R-MMU-168256 | REAC | Immune System |
| 0.0000000 | 1559 | 411 | 30 | 0.0729927 | 0.0192431 | REAC:R-MMU-392499 | REAC | Metabolism of proteins |
# inspect top one from three data sources for all genes
knitr::kable(rbind(enriched[enriched$source == "GO:BP",][1,9:13],
enriched[enriched$source == "REAC",][1,9:13],
enriched[enriched$source == "WP",][1,9:13]),
format = "html",
table.attr = 'style="width: 30%; border-collapse: separate; border-spacing: 15px;"')
| term_id | source | term_name | effective_domain_size | source_order | |
|---|---|---|---|---|---|
| 1 | GO:0008150 | GO:BP | biological_process | 27334 | 3212 |
| 73 | REAC:R-MMU-173623 | REAC | Classical antibody-mediated complement activation | 8407 | 293 |
| 107 | WP:WP33 | WP | Fatty acid omega-oxidation | 4529 | 25 |
# inspect top one from three data sources for up-regulated genes
knitr::kable(rbind(enriched_upreg[enriched_upreg$source == "GO:BP",][1,9:13],
enriched_upreg[enriched_upreg$source == "REAC",][1,9:13],
enriched_upreg[enriched_upreg$source == "WP",][1,9:13]),
format = "html",
table.attr = 'style="width: 30%; border-collapse: separate; border-spacing: 15px;"')
| term_id | source | term_name | effective_domain_size | source_order | |
|---|---|---|---|---|---|
| 1 | GO:0008150 | GO:BP | biological_process | 27334 | 3212 |
| 15 | REAC:0000000 | REAC | REACTOME root term | 8407 | 1718 |
| 56 | WP:000000 | WP | WIKIPATHWAYS | 4529 | 1 |
# inspect top one from three data sources for down-regulated genes
knitr::kable(rbind(enriched_downreg[enriched_downreg$source == "GO:BP",][1,9:13],
enriched_downreg[enriched_downreg$source == "REAC",][1,9:13],
enriched_downreg[enriched_downreg$source == "WP",][1,9:13]),
format = "html",
table.attr = 'style="width: 30%; border-collapse: separate; border-spacing: 15px;"')
| term_id | source | term_name | effective_domain_size | source_order | |
|---|---|---|---|---|---|
| 1 | GO:0008150 | GO:BP | biological_process | 27334 | 3212 |
| 3 | REAC:0000000 | REAC | REACTOME root term | 8407 | 1718 |
| 21 | WP:000000 | WP | WIKIPATHWAYS | 4529 | 1 |
The top term returned in each data source for all genes: * GO:BP: biological_process * REAC: Classical antibody-mediated complement activation * WP: Fatty acid omega-oxidation
The top term returned in each data source for both up- and
down-regulated genes: * GO:BP:
biological_process * REAC: REACTOME root term
* WP: WIKIPATHWAYS
# Three interative figures were generate by g:profiler::gostplot
# Generate a Manhattan plot to visualize the distribution of the top genesets from each data source.
gostplot(enrich) %>% plotly::layout(title = "Manhattan plot for genes", font = list(size = 10))
# Generate a Manhattan plot to visualize the distribution of the top genesets from each data source.
gostplot(enrich_upreg) %>% plotly::layout(title = "Manhattan plot for Upregulated genes", font = list(size = 10))
# Generate a Manhattan plot to visualize the distribution of the top genesets from each data source.
gostplot(enrich_downreg) %>% plotly::layout(title = "Manhattan plot for Downregulated genes", font = list(size = 10))
Figure 5. Manhattan plot displaying the distribution of the top genesets from each data source for downregulated genes of 4 Smo(flox/-) mutants (MUT) and 4 wild type (WT) samples. A extensive number of REAC genesets is returned compared to GP:BP and WP, and genesets of biological_process, REACTOME root term, and WIKIPATHWAYS are exceeding 16 -log10 correction threshold.
Figure 6. Manhattan plot displaying the distribution of the top genesets from each data source for up-regulated genes of 4 Smo(flox/-) mutants (MUT) and 4 wild type (WT) samples. A extensive number of GP:BP genesets is returned compared to REAC and WP, and the genesets of RNA processing, biological_process, Metabolism, REACTOME root term, and WIKIPATHWAYS are exceeding 16 -log10 correction threshold.
Figure 7. Manhattan plot displaying the distribution of the top genesets from each data source for downregulated genes of 4 Smo(flox/-) mutants (MUT) and 4 wild type (WT) samples. A extensive number of REAC genesets is returned compared to GP:BP and WP, and genesets of biological_process, REACTOME root term, and WIKIPATHWAYS are exceeding 16 -log10 correction threshold.
nrow(enriched)
## [1] 107
nrow(enriched_upreg)
## [1] 69
nrow(enriched_downreg)
## [1] 31
Rowan CJ, Li W, Martirosyan H, Erwood S, Hu D, Kim YK,
Sheybani-Deloui S, Mulder J, Blake J, Chen L, Rosenblum ND.
[Hedgehog-GLI signaling in Foxd1-positive stromal cells promotes murine
nephrogenesis via TGF&beta signaling.] Development 1
July 2018; 145 (13): dev159947. (doi: 10.1242/dev.159947).
Miura Y. The biological significance of ω-oxidation of fatty acids. Proc Jpn Acad Ser B Phys Biol Sci. 2013;89(8):370-82. (doi: 10.2183/pjab.89.370.)
Chighizola CB, Lonati PA, Trespidi L, Meroni PL, Tedesco F. The Complement System in the Pathophysiology of Pregnancy and in Systemic Autoimmune Rheumatic Diseases During Pregnancy. Front Immunol. 2020 Aug 27;11:2084. (doi: 10.3389/fimmu.2020.02084.)